Overview
The following report details the methods used to determine appropriate filter thresholds for SNV variant calls.
Creating simulated data
On the site level, three major filters were applied to obtain high-quality variants: variant quality normalized by read depth (QD), strand odds ratio (SOR) and Fisherstrand (FS). To find the optimal filter thresholds, the following steps were taken. Note the filter thresholds were only optimized for SNPs.
- A “truth” set of variants were created using either a subset of real variants from 2018 release, or randomly selected positions on the genome. The results were the same with either set of truth variants so only the former were shown below.
- The truth set of variants were inserted in N2.bam with bamsurgeon (20d431e).

- Variants were called with the wi-gatk-nf pipeline.
Optimize the QD filter
To reduce complexity, the filter thresholds were optimized one at a time. When optimizing the QD filter for example, no other filters were further applied.
The optimal QD threshold were determined as follows:
- For one filter threshold, here shows example QD > 10, variants called in step 3 that passed the filters were considered detected, and those that failed the filter were considered undetected.
|
CHROM
|
POS
|
QD
|
sim1_genotype
|
sim2_genotype
|
sim3_genotype
|
==>
|
pass_QD_filter
|
is_detected
|
|
I
|
1352
|
110
|
1/1
|
1/1
|
1/1
|
QD threshold is 10
|
yes
|
yes
|
|
I
|
2566
|
90
|
1/1
|
0/0
|
1/1
|
QD threshold is 10
|
yes
|
yes
|
|
I
|
3847
|
2
|
0/0
|
1/1
|
0/0
|
QD threshold is 10
|
no
|
no
|
|
I
|
4975
|
38
|
1/1
|
0/0
|
0/0
|
QD threshold is 10
|
no
|
no
|
|
I
|
5590
|
298
|
1/1
|
1/1
|
1/1
|
QD threshold is 10
|
yes
|
yes
|
- Each variant, depending on whether it is detected and whether it is in the truth set, will fall into 1 of the 4 categories: true positive, true negative, false positive and false negative.
|
CHROM
|
POS
|
is_detected
|
is_in_truth
|
category
|
|
I
|
1352
|
yes
|
yes
|
true positive
|
|
I
|
2566
|
yes
|
no
|
false positve
|
|
I
|
3847
|
no
|
no
|
true negative
|
|
I
|
4975
|
no
|
yes
|
false negative
|
|
I
|
5590
|
yes
|
yes
|
true positive
|
- A confusion matrix can then be created using the variant counts for each category.
|
|
in_truth
|
not_in_truth
|
|
detected
|
count of true positive
|
count of false positive
|
|
not_detected
|
count of false negative
|
count of true negative
|
- Filter thresholds were chosen to maximize true positive rate and precision, while minimizing false positive rate.
Optimize the SOR filter
- The same steps were taken to find optimal SOR threshold.
Optimize the FS filter
- The same steps were taken to find optimal FS threshold.
QD, SOR, FS filter in combination
- Here shows different combination of QD, SOR and FS thresholds. The thresholds determined from optimizing each filter threshold individually as shown above (red point) is very close to the most optimal combination of thresholds.
